Customer churn in utilities

A predictive model using Spatial Data Science

In this notebook a predictive model is built in order to identify what clients are likely to churn in the following month.

The data used for this model is the already processed file: finaldataframe[date]_[version].csv which is built with edp_churn_data_pipeline_revised.ipynb using as sources:

  • Customer historical data
  • Sociodemographics data (Source: Michael Bauer)

Previous feature engineering is performed to include spatial features to our model describing the churners behavior in a client's neighborhood. Some examples of these features are:

  • Churn rate & trend in the last 6 months in the client's zip code
  • Lifetime coefficient of the client compared to churners in the same zip code
  • Expected annual consumption coefficient of the client compared to churners in the same zip code

This notebook shows the main steps followed when training the model. It is organized by iterations. This notebook contains the iterations that contributed the most to the model's performance improvement:

  • Iteration 0: Baseline model

    A baseline model in built considering only features created using customer profile data and a XGBClassifier algorithm with default hyperparameters.

  • Iteration 1: Weights to observations and bad quality predictors removed

    Recent churn events are given more weight than older ones. In addition, features with a strong time dependence are either removed or detrended.

  • Iteration 2: Introduction of spatial features + Recursive Feature Elimination

    Spatial features are included into the model training.

    We apply advanced techniques to identify the 25 monst impactfull features. This is done with two goals:

    • Get explainable results from a business point of view (too many features make it harder)
    • Reduce noise
  • Iteration 3: Algorithm optimization

    Hyperparameter tuning to find the optimal ones. Results obtained are explained in further detail.

In addition, two spatial analysis are performed to explain the differences in the model's performance throughout Portugal identified thanks to results visualization with CARTOFrames:

  • How performance is correlated with a zip code's overall churn rate.
  • Special case: Algarve

0. Setup

In [1]:
#!pip install cartoframes==1.0b2
#!pip install geopandas
#!pip install pysal==1.14.4  # versions >= 2.0.0 won't work in Colab
#!pip install shap
In [2]:
import geopandas
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import pysal
import seaborn as sns
import shap

from cartoframes.auth import Credentials, set_default_credentials
from cartoframes.client import SQLClient
from cartoframes.data import Dataset
from cartoframes.viz import Map, Layer, basemaps
from cartoframes.viz.helpers import color_bins_layer
from google.colab import drive
drive.mount('/gdrive')
from inspect import signature
from shapely import wkt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, auc, average_precision_score,classification_report, confusion_matrix, f1_score, precision_recall_curve, \
precision_score, recall_score, roc_curve
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from yellowbrick.model_selection import ValidationCurve


shap.initjs()
sns.set()
/usr/local/lib/python3.6/dist-packages/pysal/__init__.py:65: VisibleDeprecationWarning: PySAL's API will be changed on 2018-12-31. The last release made with this API is version 1.14.4. A preview of the next API version is provided in the `pysalnext` package. The API changes and a guide on how to change imports is provided at https://migrating.pysal.org
  ), VisibleDeprecationWarning)
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /gdrive
In [ ]:
my_base_url = 'https://malvarez-carto.carto.com/'
my_api_key=''

cred = Credentials(base_url=my_base_url,
                  api_key=my_api_key)

set_default_credentials(
    base_url=my_base_url,
    api_key=my_api_key
)

sql_client = SQLClient(cred)
In [ ]:
run_date = 201901

1. Load data

  • Preprocessed data
  • zip code geometries to visulaize data
In [16]:
df_final = pd.read_csv('/gdrive/My Drive/final_dataframe_201901_new_MODIFIED.csv')
df_final.head()
Out[16]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar
0 852564277059 False 2815 201211 201901 1 2 1 1 0 2 2223.0 1375.99158 1 2 2 0 2 1 2 0 0 1 1 0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 11 1 15917.0 6061.0 2.626134 2500.0 2286.0 3365.0 3481.0 4285.0 117.480025 7891.0 0.495759 4868.0 0.305837 724.5 1884.0 6.350020 2506.606282 0.055653 -0.011195 978.121951 1301.637626 1860.0 1740.0 0.000000 0.0 2218.0 8868.0 0
1 357752500269 False 6000 201410 201901 1 1 1 0 0 1 1532.0 1375.99158 1 1 1 1 6 0 0 0 1 1 1 0 552.0 0.0 552.0 0.0 552.0 0.0 552.0 0.0 10 1 42698.0 18806.0 2.270446 5205.0 6601.0 8880.0 9331.0 12681.0 104.689954 34201.0 0.800998 20453.0 0.479015 782.0 1632.0 1134.547781 37.634378 0.039662 -0.008541 1142.775281 1426.329778 1800.0 1392.0 0.000000 0.0 9000.0 8868.0 0
2 653199189182 True 1600 201405 201901 2 2 2 0 0 2 1679.0 1375.99158 2 2 1 0 1 0 1 1 0 0 0 1 2856.0 0.0 2856.0 0.0 1500.0 0.0 4212.0 0.0 5 1 55701.0 23868.0 2.333710 8628.0 9204.0 11400.0 11779.0 14690.0 205.384118 41257.0 0.740687 21955.0 0.394158 731.0 1872.0 10.579398 5265.044268 0.055815 -0.001712 1072.253780 1362.235793 1890.0 1848.0 0.011765 1.0 9000.0 8868.0 0
3 224790727359 False 4465 201701 201901 1 3 1 0 2 1 720.0 1375.99158 1 3 3 0 4 1 1 0 0 0 0 1 5124.0 0.0 5124.0 0.0 5124.0 0.0 5124.0 0.0 1 1 40507.0 16682.0 2.428186 5034.0 6367.0 8288.0 9431.0 11387.0 113.125676 27394.0 0.676278 15654.0 0.386452 756.0 2100.0 14.435399 2806.088102 0.057909 -0.049516 1070.450408 1325.711287 2292.0 1860.0 0.066667 1.0 9000.0 8868.0 0
4 528190810480 False 2430 201805 201901 2 2 1 0 1 1 220.0 1375.99158 1 1 2 2 2 0 1 0 0 1 0 1 1992.0 0.0 1992.0 0.0 1992.0 0.0 1992.0 0.0 5 1 37054.0 15396.0 2.406729 4976.0 5527.0 8185.0 8108.0 10258.0 98.981843 26820.0 0.723808 14741.0 0.397825 742.5 1800.0 183.327912 202.118705 0.058346 -0.023941 1018.542857 1266.463296 1872.0 1866.0 0.023077 3.0 9000.0 8868.0 0
In [ ]:
#### SOCIODEMOG
query = """
SELECT 
    *,
    ST_AsText(the_geom) as geometry
FROM edp_cp_demographics_agg
"""
In [18]:
df_demog = Dataset(query).download()
df_demog['postcode'] = df_demog['postcode'].astype(int)
demog = geopandas.GeoDataFrame(df_demog)
demog['geometry'] = demog['geometry'].apply(wkt.loads)
demog.crs = {'init': 'EPSG:4326'}
demog.head(2)
Out[18]:
the_geom ctrycode postcode name p_t p_prm hh_t hh_size male female age_t0014 age_m0014 age_f0014 age_t1529 age_m1529 age_f1529 age_t3044 age_m3044 age_f3044 age_t4559 age_m4559 age_f4559 age_t60pl age_m60pl age_f60pl pp_mio pp_prm pp_euro pp_ci no_contracts no_contracts_rel no_contracts_e no_contracts_e_rel no_contracts_g no_contracts_g_rel no_contracts_s no_contracts_s_rel no_clients no_clients_rel no_big_clients no_big_clients_rel no_churn_clients no_churn_clients_rel median_life_time_days qtd_consumo_anual_esp solar_radiation area pop_density flg_deb_direto flg_fatura_eletronica flg_edp_online no_appart no_houses avg_annual_prec median_life_time_days_all avg_annual_temp avg_annual_vapr pvout no_solar_installations no_clients_solar no_clients_solar_rel_clients no_clients_solar_rel_population geometry
cartodb_id
1 0106000020E61000000300000001030000000100000017... PT 2520 Peniche 16474.0 1.605416 6891.0 2.390654 7883.0 8591.0 2078.0 1093.0 985.0 2743.0 1400.0 1343.0 3163.0 1560.0 1603.0 3785.0 1773.0 2012.0 4705.0 2057.0 2648.0 165.137475 1.332970 10024.127397 83.029574 12309.0 0.747177 10579.0 0.642163 193.0 0.011715 1537.0 0.093299 8836.0 0.536360 40.0 0.002428 1170.0 0.132398 860.0 1404.0 4.361516 21.779561 756.397240 4836.0 3365.0 1963.0 6652.0 3168.0 54.250000 1626.0 15.260999 1.382233 4.066355 14.0 14.0 0.001584 0.000850 (POLYGON ((-9.34110076999997 39.3798933910001,...
2 0106000020E61000000100000001030000000100000071... PT 2525 Atouguia da Baleia 10045.0 0.978900 3922.0 2.561193 4921.0 5124.0 1493.0 761.0 732.0 1542.0 779.0 763.0 2275.0 1134.0 1141.0 2023.0 1002.0 1021.0 2712.0 1245.0 1467.0 103.401660 0.834646 10293.843745 85.263627 7447.0 0.741364 6703.0 0.667297 1.0 0.000100 743.0 0.073967 5843.0 0.581682 29.0 0.002887 535.0 0.091547 1010.0 1428.0 4.410309 57.459145 174.819865 3788.0 2390.0 1454.0 2898.0 3272.0 54.972222 1701.5 15.387333 1.371933 4.120259 24.0 24.0 0.004107 0.002389 (POLYGON ((-9.275303377999929 39.366410439, -9...
In [19]:
## Residual table
relevant_items = ['the_geom', 'geometry', 'postcode', 'name', 'no_churn_clients', 'no_churn_clients_rel']
residuals = demog[relevant_items]
ds = Dataset(residuals)
ds.upload(table_name='edp_churn_residuals', if_exists='replace', credentials=cred)
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: ad33dd6d-c70a-4269-b08b-d0f4297a3ac6
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
Out[19]:
<cartoframes.data.dataset.Dataset at 0x7f2270e87a58>
In [ ]:
residual_query = """
SELECT 
    *,
    ST_AsText(the_geom) as geometry
FROM edp_churn_residuals
"""
In [21]:
# In order to characterize post codes better, we use the historical churn_rate
df_final = df_final.merge(df_demog[['postcode', 'no_churn_clients_rel']], how='left', left_on='cp_main', right_on='postcode').drop('postcode', axis=1)
df_final.head(2)
Out[21]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar no_churn_clients_rel
0 852564277059 False 2815 201211 201901 1 2 1 1 0 2 2223.0 1375.99158 1 2 2 0 2 1 2 0 0 1 1 0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 11 1 15917.0 6061.0 2.626134 2500.0 2286.0 3365.0 3481.0 4285.0 117.480025 7891.0 0.495759 4868.0 0.305837 724.5 1884.0 6.350020 2506.606282 0.055653 -0.011195 978.121951 1301.637626 1860.0 1740.0 0.0 0.0 2218.0 8868.0 0 0.168412
1 357752500269 False 6000 201410 201901 1 1 1 0 0 1 1532.0 1375.99158 1 1 1 1 6 0 0 0 1 1 1 0 552.0 0.0 552.0 0.0 552.0 0.0 552.0 0.0 10 1 42698.0 18806.0 2.270446 5205.0 6601.0 8880.0 9331.0 12681.0 104.689954 34201.0 0.800998 20453.0 0.479015 782.0 1632.0 1134.547781 37.634378 0.039662 -0.008541 1142.775281 1426.329778 1800.0 1392.0 0.0 0.0 9000.0 8868.0 0 0.115332

2. Data transformation and correlation matrix

2.1 Relative amount calculation

Here we normalize:

  • Variables relative to accounts, contracts, and dwellings wrt the total number of each of these.
  • Demographic variables wrt the zip code's population.
  • Lifetime and consumption variables are normalized wrt to temporal and spatio-temporal amounts.
In [ ]:
df_final['active_accounts_rel'] = df_final['no_active_accounts'] / df_final['id_conta_rgpd_count']
df_final['active_contracts_rel'] = df_final['no_active_contracts'] / df_final['id_contrato_rgpd_count']
df_final['apartamento_rel'] = df_final['apartamento'] / (df_final['apartamento'] + df_final['moradia'] + df_final['other_dwellings'])
df_final['moradia_rel'] = df_final['moradia'] / (df_final['apartamento'] + df_final['moradia'] + df_final['other_dwellings'])
df_final['other_dwellings_rel'] = df_final['other_dwellings'] / (df_final['apartamento'] + df_final['moradia'] + df_final['other_dwellings'])
In [ ]:
# Watch out! There are clients registered in zip codes not included in the sociodem data. For now, we filter them.
df_final = df_final[df_final['p_t'] > 0]
df_final['age_t0014'] = df_final['age_t0014'] / df_final['p_t']
df_final['age_t1529'] = df_final['age_t1529'] / df_final['p_t']
df_final['age_t3044'] = df_final['age_t3044'] / df_final['p_t']
df_final['age_t4559'] = df_final['age_t4559'] / df_final['p_t']
df_final['age_t60pl'] = df_final['age_t60pl'] / df_final['p_t']
In [ ]:
df_final['lifetime_global_coeff'] = df_final['life_time_days'] / (df_final['lt_global_pop_mean_6m']+1)
df_final['lifetime_local_churn_coeff'] = df_final['life_time_days'] / (df_final['lt_local_churn_mean_12m']+1)
#df_final['lifetime_local_coeff'] = df_final['life_time_days'] / (df_final['lt_local_pop_mean_12m']+1)
df_final['lifetime_coeff'] = df_final['life_time_days'] / (df_final['median_life_time_days_cp_main']+1)
df_final['qtd_consumo_anual_E_12m_coeff'] = df_final['median_E'] / (df_final['cp_qtd_consumo_anual_esp_E_12m']+1)
df_final['qtd_consumo_anual_G_12m_coeff'] = df_final['median_G'] / (df_final['cp_qtd_consumo_anual_esp_G_12m']+1)
df_final['qtd_consumo_anual_coeff'] = (df_final['mean_E'] * df_final['e_count'] + df_final['mean_G'] * df_final['g_count']) /\
  (df_final['qtd_consumo_anual_esp_cp_main']+1)

2.2 Value reduction

Some continuous (integer) variables have very large values. We'll assign them the next value of their 0.98 quantile.

In [25]:
interest_feats = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 'g_count', 's_count', 'eg_count', 
                  'no_active_accounts', 'no_active_contracts', 'no_products', 'apartamento', 'moradia', 'other_dwellings']

quantile = 0.98
for feature in interest_feats:
  quant_val = df_final[feature].quantile(quantile)
  print(feature, quant_val)
  df_final.loc[df_final[feature] > quant_val + 1, feature] = quant_val + 1
id_conta_rgpd_count 3.0
id_contrato_rgpd_count 5.0
e_count 3.0
g_count 1.0
s_count 2.0
eg_count 4.0
no_active_accounts 3.0
no_active_contracts 4.0
no_products 4.0
apartamento 3.0
moradia 2.0
other_dwellings 1.0

2.3 Var classification

We classify variables by their type for further treatment.

In [ ]:
#Target column
target_var = 'churn'
#categorical columns
cat_vars   = ['consent_data', 'main_postal_region', 'main_urban_type', 'start_month']
#Binary columns with 2 values
bin_vars   = ['consent_data', 'flg_deb_direto	', 'flg_fatura_eletronica', 'flg_edp_online', 'flg_cod_produto_key', 'solar']
#numerical columns
num_vars  = [x for x in df_final.loc[:, 'id_conta_rgpd_count':].columns.values.tolist() if x not in cat_vars + bin_vars]
In [27]:
df_final[num_vars].describe()
Out[27]:
id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products apartamento moradia other_dwellings flg_deb_direto mean_E mean_G median_E median_G min_E min_G max_E max_G p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade no_churn_clients_rel active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff
count 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06 3.490372e+06
mean 1.266406e+00 1.753750e+00 1.215042e+00 2.196296e-01 3.097567e-01 1.438643e+00 1.348331e+03 1.317661e+03 1.197519e+00 1.607505e+00 1.524701e+00 8.005313e-01 5.654818e-01 6.648661e-02 5.283302e-01 2.379605e+03 5.427396e+02 2.353492e+03 5.420576e+02 2.137781e+03 5.292069e+02 2.666522e+03 5.577051e+02 3.297545e+04 1.318876e+04 2.494954e+00 1.354596e-01 1.564561e-01 2.032008e-01 2.162688e-01 2.886148e-01 1.008168e+02 2.289733e+04 6.970796e-01 1.347451e+04 4.271462e-01 7.964865e+02 1.781818e+03 1.573903e+02 1.760208e+03 4.895573e-02 -5.738035e-03 1.026754e+03 1.255861e+03 1.909326e+03 1.379149e+03 8.780543e-03 7.846616e-01 6.899536e+03 8.457418e+03 1.436758e-01 9.697168e-01 9.534934e-01 5.180932e-01 4.448754e-01 3.703146e-02 1.018747e+00 1.419117e+00 1.719982e+00 3.598637e+00 1.253576e+01 2.009658e+00
std 5.984810e-01 1.049595e+00 5.690594e-01 4.337184e-01 6.423427e-01 7.401393e-01 7.795010e+02 1.798664e+02 5.220919e-01 8.775591e-01 7.888824e-01 8.574640e-01 6.454461e-01 2.891886e-01 4.991968e-01 3.175358e+03 8.570929e+03 3.156990e+03 8.567592e+03 3.041743e+03 8.490148e+03 3.859251e+03 8.881134e+03 1.791177e+04 7.143032e+03 2.227267e-01 2.533412e-02 1.844800e-02 2.965808e-02 1.722976e-02 6.814190e-02 2.726138e+01 1.300506e+04 1.515983e-01 6.995969e+03 9.684319e-02 1.259438e+02 3.343942e+02 2.260144e+02 2.742860e+03 1.552821e-02 1.024011e-01 1.670075e+02 1.910866e+02 3.865450e+02 8.756120e+02 2.020414e-02 1.464084e+00 3.547055e+03 1.780313e+03 3.760363e-02 1.195740e-01 1.411238e-01 4.802307e-01 4.774159e-01 1.667547e-01 5.840472e-01 9.974963e+00 2.121773e+00 1.363147e+02 1.285303e+03 6.714186e+00
min 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 9.666667e+00 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -1.509600e+04 -8.748000e+04 -1.509600e+04 -8.748000e+04 -2.559600e+04 -8.748000e+04 -1.509600e+04 -8.748000e+04 6.700000e+02 3.310000e+02 1.948212e+00 4.543815e-02 5.907473e-02 3.582090e-02 1.343350e-01 1.107727e-01 4.527288e+01 1.000000e+00 1.522765e-04 1.000000e+00 1.142074e-04 0.000000e+00 0.000000e+00 6.896414e-01 3.851310e+00 0.000000e+00 -9.980040e-01 0.000000e+00 5.562500e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -3.100000e+01 0.000000e+00 0.000000e+00 4.484305e-03 4.347826e-03 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -8.746604e+00 -1.202400e+04 -3.908698e+01
25% 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 7.517500e+02 1.375992e+03 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.020000e+03 0.000000e+00 9.960000e+02 0.000000e+00 7.320000e+02 0.000000e+00 1.128000e+03 0.000000e+00 1.877500e+04 7.614000e+03 2.366248e+00 1.199585e-01 1.453244e-01 1.877135e-01 2.038903e-01 2.429443e-01 8.229634e+01 1.214400e+04 5.958063e-01 7.858000e+03 3.642241e-01 7.150000e+02 1.584000e+03 2.081229e+01 1.189517e+02 3.774433e-02 -4.161730e-02 9.588208e+02 1.234294e+03 1.704000e+03 0.000000e+00 0.000000e+00 0.000000e+00 2.150000e+03 8.868000e+03 1.158059e-01 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 5.984060e-01 7.706724e-01 9.597270e-01 5.355151e-01 0.000000e+00 7.402071e-01
50% 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.410000e+03 1.375992e+03 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.848000e+03 0.000000e+00 1.836000e+03 0.000000e+00 1.704000e+03 0.000000e+00 2.016000e+03 0.000000e+00 3.210400e+04 1.247600e+04 2.465137e+00 1.344473e-01 1.564339e-01 2.073324e-01 2.172829e-01 2.721929e-01 9.572864e+01 2.186900e+04 6.762782e-01 1.331300e+04 4.026341e-01 7.830000e+02 1.788000e+03 7.496094e+01 3.911485e+02 5.072091e-02 -9.894375e-03 1.048983e+03 1.287974e+03 1.836000e+03 1.704000e+03 0.000000e+00 0.000000e+00 9.000000e+03 8.868000e+03 1.461765e-01 1.000000e+00 1.000000e+00 5.000000e-01 0.000000e+00 0.000000e+00 1.044306e+00 1.328421e+00 1.711143e+00 9.782374e-01 0.000000e+00 1.375070e+00
75% 1.000000e+00 2.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 1.818000e+03 1.375992e+03 1.000000e+00 2.000000e+00 2.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 1.000000e+00 2.988000e+03 0.000000e+00 2.976000e+03 0.000000e+00 2.844000e+03 0.000000e+00 3.228000e+03 0.000000e+00 4.585000e+04 1.746100e+04 2.615465e+00 1.513868e-01 1.667313e-01 2.224336e-01 2.303967e-01 3.280360e-01 1.110335e+02 3.062300e+04 7.723241e-01 1.776900e+04 4.747475e-01 8.600000e+02 1.992000e+03 1.870293e+02 2.386383e+03 6.102077e-02 2.110960e-02 1.123808e+03 1.348261e+03 2.100000e+03 1.944000e+03 1.142857e-02 1.000000e+00 9.000000e+03 8.868000e+03 1.761480e-01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 1.340604e+00 1.736371e+00 2.300131e+00 1.560183e+00 0.000000e+00 2.324928e+00
max 4.000000e+00 6.000000e+00 4.000000e+00 2.000000e+00 3.000000e+00 5.000000e+00 4.500000e+03 1.375992e+03 4.000000e+00 5.000000e+00 5.000000e+00 4.000000e+00 3.000000e+00 2.000000e+00 1.000000e+00 3.155460e+06 6.866376e+06 3.155460e+06 6.866376e+06 3.155460e+06 6.866376e+06 3.155460e+06 6.866376e+06 8.103500e+04 3.210300e+04 3.501466e+00 2.200777e-01 2.478301e-01 3.269988e-01 2.560976e-01 6.671642e-01 2.226849e+02 5.626000e+04 1.349254e+00 3.237100e+04 1.161194e+00 2.359000e+03 3.132000e+03 1.441193e+03 1.473122e+04 5.000000e-01 1.000000e+01 2.921000e+03 1.621221e+03 1.853580e+05 8.760000e+03 1.000000e+00 3.900000e+01 9.000000e+03 8.868000e+03 3.333333e-01 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 5.699449e+00 2.921000e+03 2.050000e+03 8.307600e+04 1.489200e+06 3.814574e+03

2.4 Outliers and errors

We identified some negative values, and some very extreme ones in expected consumption.

If we take a closer look at extreme values, we can see they are especially highly concentrated in three regions (see map below), but for less than 5% of the zip codes extreme account for more than 1% of the customers there.

Given this, we decided to discard:

  • Clients with a negative expected consumption.
  • 0.5% of extreme (high) values.
In [28]:
metric = 'mean_E'
plt.figure(figsize=(12,5))
sns.distplot(df_final.loc[df_final[metric] < df_final[metric].quantile(0.995), metric])
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2271d9fac8>

Let's see if they are greographically concentrated.

In [29]:
condition = df_final[metric] >= df_final[metric].quantile(0.995)
#condition = df_final[metric] < 0

outlier_concentration = demog[['geometry', 'postcode', 'name', 'no_clients']].\
  merge(df_final[condition].\
          groupby('cp_main').agg({'id_cliente_rgpd':'count'}).\
          reset_index().rename(columns={'id_cliente_rgpd':'outlier_count'}), 
        how='left', left_on='postcode', right_on='cp_main').fillna(0).drop('cp_main', axis=1)

outlier_concentration['outlier_count_rel'] = outlier_concentration['outlier_count'] / (outlier_concentration['no_clients']+1)

ds = Dataset(outlier_concentration)
ds.upload(table_name='edp_outlier_concentration', if_exists='replace', credentials=cred)
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: e96b5ee7-dcb1-44d8-a0e1-fe8d5bd0e93d
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
Out[29]:
<cartoframes.data.dataset.Dataset at 0x7f226ae8d2b0>
In [30]:
category_attribute = 'outlier_count_rel' # outlier_count
Map(
    color_bins_layer('edp_outlier_concentration', category_attribute, 'Outlier count')
)
Out[30]:

Filtering is applied both based on electricity and gas consumption:

In [ ]:
df_final = df_final[(df_final['min_E'] >= 0) & (df_final['max_E'] < df_final['max_E'].quantile(0.995))]
df_final = df_final[(df_final['min_G'] >= 0) & (df_final['max_G'] < df_final['max_G'].quantile(0.995))]

2.5 Categorical one hot encoding

In [32]:
one_hot_cat = pd.get_dummies(data = df_final[cat_vars],columns = cat_vars)
one_hot_cat.head(2)
Out[32]:
consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12
0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
In [33]:
df_final = pd.concat([df_final, one_hot_cat], axis=1)
df_final.head(2)
Out[33]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar no_churn_clients_rel active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12
0 852564277059 False 2815 201211 201901 1.0 2.0 1.0 1.0 0.0 2.0 2223.0 1375.99158 1.0 2.0 2.0 0 2 1 2.0 0.0 0.0 1 1 0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 11 1 15917.0 6061.0 2.626134 0.157065 0.143620 0.211409 0.218697 0.269209 117.480025 7891.0 0.495759 4868.0 0.305837 724.5 1884.0 6.350020 2506.606282 0.055653 -0.011195 978.121951 1301.637626 1860.0 1740.0 0.0 0.0 2218.0 8868.0 0 0.168412 1.0 1.0 1.0 0.0 0.0 1.614389 2.270402 3.064094 1.083289 2.508903 3.386737 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
1 357752500269 False 6000 201410 201901 1.0 1.0 1.0 0.0 0.0 1.0 1532.0 1375.99158 1.0 1.0 1.0 1 6 0 0.0 0.0 1.0 1 1 0 552.0 0.0 552.0 0.0 552.0 0.0 552.0 0.0 10 1 42698.0 18806.0 2.270446 0.121903 0.154597 0.207972 0.218535 0.296993 104.689954 34201.0 0.800998 20453.0 0.479015 782.0 1632.0 1134.547781 37.634378 0.039662 -0.008541 1142.775281 1426.329778 1800.0 1392.0 0.0 0.0 9000.0 8868.0 0 0.115332 1.0 1.0 0.0 0.0 1.0 1.112570 1.339424 1.956577 0.306496 0.000000 0.338028 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0

2.6 Correlation matrix

We take a first look at the correlation matrix. First, considering all historical data, afterwards only considering churn happening after a certain date, and finally considering clients joining the company after a certain date.

We can see churn is weakly correlated to most of the features except for those related to life time moving average at different time and spatial levels. This shows we have to be extremely careful with features regarding life time as it is strongly time-dependent.

Zooming on high correlation areas of the corr matrix, we can identify some highly correlated variables. We should be careful with these variables so that they don't introduce noise to the model we train.

In [ ]:
non_spatial_features = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 'g_count',
                       's_count', 'eg_count', 'life_time_days', 'lifetime_global_coeff',
                       'no_active_accounts', 'no_active_contracts', 'no_products',
                       'consent_data', 'apartamento',
                       'moradia', 'other_dwellings', 'flg_deb_direto', 'flg_fatura_eletronica',
                       'flg_edp_online', 'mean_E', 'mean_G', 'median_E', 'median_G', 'min_E',
                       'min_G', 'max_E', 'max_G', 'start_month', 'flg_cod_produto_key', 
                       'time_since_last_upgrade',
                       'time_since_last_downgrade', 'solar', 'active_accounts_rel',
                       'active_contracts_rel', 'apartamento_rel', 'moradia_rel',
                       'other_dwellings_rel']
In [35]:
plot_corr_matrix(df_final.sample(100000), ['churn'] + non_spatial_features, 'corr - non-spatial features')
In [36]:
item_list = ['churn'] + df_final.columns[5:75].tolist()
plot_corr_matrix(df_final.sample(100000), item_list, 'corr')

Let's now take a look considering only customers who churned after 201701 or are still active.

In [37]:
plot_corr_matrix(df_final[df_final['year_month'] > 201701].sample(35000), item_list, 'corr 201701 - ')

Churn is highly correlated with featres regarding time-window life time values and with the feature flg_cod_produto_key which identifies if the client has an E/G product different from "Electricidade"/"Gas".

We must be extremely careful with these features because as we saw in our first analysis (summarized in edp_data_summary.ipynp) all of them depend heavily on time: lifetime increases continuously over time, and there are no new clients with E/G contracts different from "Electricity"/"Gas".

In [38]:
# Churn's strongest correlations
for item in item_list:
  corr_coef = np.corrcoef(df_final.loc[df_final['year_month_start'] > 201201, 'churn'], 
                          df_final.loc[df_final['year_month_start'] > 201201, item])[0, 1]
  if np.abs(corr_coef) > 0.4:
    print(item, corr_coef)
churn 0.9999999999999999
lt_global_pop_mean_6m -0.8018043980024551
flg_cod_produto_key -0.7067193080855573
lt_local_churn_mean_12m -0.6108097601704262
lt_local_pop_mean_12m -0.7278473628172

Let's zoom in over those areas of the corr matrix with higher correlation values.

Summary:

  • We should only take one E, and one G consumption measurement.
  • Wrt number of contracts, there are also very highly correlated variables. We should be careful with these too.
  • Demographics variables also show some high correlations.
In [39]:
item_list = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 'g_count',
             's_count', 'eg_count', 'no_active_accounts', 'no_active_contracts', 'no_products',
            'active_accounts_rel', 'active_contracts_rel']
plot_corr_matrix(df_final[df_final['year_month_start'] > 201201].sample(35000), item_list, 'corr')

3. Model building

We'll start working with features obtained from EDP's historical data, and will increasingly add spatial features to measure the improvement in performance.

Class inbalance

There is a 86-14 class imbalance in our target variable. We will balance this variable downsampling non-churners.

Performance metric

Even though throughout the model building process we evaluate different metrics, our focus metric will be recall. This is because not identifying a churner properly and losing the client is more costly than identifying a non-churner as churner. In particular, we especially care about recall in the last month of the prediction (in this case January 2019).

Time dependent variables

On all the analyses done, we have seen churn patterns have changed through time. For example, while in 2016/2017 flg_cod_produto_key was a very good predictor, it is no longer the case. We should be extremely careful with this. That is why we decided to limit the data used to train the model, hence the two parameters below:

In [ ]:
starting_churn_date_cut = 201601
starting_join_date_cut = 201201

Itearion 0

We'll start by training an XGBClassifier model with default hyperparameters, and the features in non_spatial_features.

Summary of results:

  • This initial algorithm has an accuracy of 0.9, which initially looks good.
  • However, when we take a look at recall, it is 0.83 overall with a drop to 0 in January 2019.
  • Model performance is very good identifying churners in 2016 and 2017, but its performance decreases dramatically as it approaches 201901. It is also far worse predicting churn for new joiners. Looking at feature importances, we can see this is due to the fact that flg_cod_produto_key is by far the most influencial predictor, which is true in 2016 and 2017, but not in 2018 onwards.
  • Moran's I spatial autocorrelation index (0.56) shows there is spatial autocorrelation wrt recall.
In [ ]:
client_features = ['id_cliente_rgpd',	'churn', 'cp_main',	'year_month_start',	'year_month']
In [42]:
df_final_f = df_final.loc[(df_final['year_month'] > starting_churn_date_cut) & (df_final['year_month_start'] >= starting_join_date_cut), client_features + non_spatial_features]
sample = df_final_f[~df_final_f['churn']].sample(df_final_f[df_final_f['churn']].shape[0], random_state=111)
sample = pd.concat([sample, df_final_f[df_final_f['churn']]])
sample.shape
Out[42]:
(695988, 41)
In [ ]:
features = non_spatial_features
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(sample.loc[:, features], sample['churn'].astype(int),
                                                    test_size=0.3, random_state=7)
In [45]:
model = XGBClassifier(random_state = 111)
model.fit(X_train, y_train)
Out[45]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=111,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

Performance

In [46]:
calculate_performance_metrics(model, sample, X_train, X_test, y_train, y_test, 0, 'recall_0')
Train score: 0.9029620826328894
Test score: 0.9034277312413492
Confusion matrix:
[[101906   2405]
 [ 17759  86727]]
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.85      0.98      0.91    104311
           1       0.97      0.83      0.90    104486

    accuracy                           0.90    208797
   macro avg       0.91      0.90      0.90    208797
weighted avg       0.91      0.90      0.90    208797

accuracy 0.9034277312413492
precision 0.9730175470089306
recall 0.8300346457898666
f1 score 0.8958567901744673
ROC_AUC: 0.9688407447710987
*** REDISUAL TIME DEPENDENCY ***
*** RESIDUAL SPATIAL AUTOCORRELATION ***
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: 42db26f0-08a8-402b-8c56-c7795128af37
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:189: UserWarning: There are 5 disconnected observations
  warnings.warn("There are %d disconnected observations" % ni)
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:190: UserWarning: Island ids: 126, 198, 199, 356, 387
  warnings.warn("Island ids: %s" % ', '.join(str(island) for island in self.islands))
('WARNING: ', 126, ' is an island (no neighbors)')
('WARNING: ', 198, ' is an island (no neighbors)')
('WARNING: ', 199, ' is an island (no neighbors)')
('WARNING: ', 356, ' is an island (no neighbors)')
('WARNING: ', 387, ' is an island (no neighbors)')
Moran's I: 0.561
Moran's E[I]: -0.002
Moran's Var[I]: 0.001

Feature importace

In [47]:
importances = pd.DataFrame(model.feature_importances_, index=sample.loc[:, features].columns, columns=['importance']).sort_values('importance', ascending=False)
plt.figure(figsize=(16,6))
sns.barplot(x='importance', y='index', data=importances.reset_index()[:25], color='blue')
#plt.title('Feature importance', size=14)
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f226ab32400>

Time dependent variables

The figure below shows the big change in flg_cod_produto_key during 2018.

In [48]:
plt.figure(figsize=(14,4))
flg_cod_produto_key_monthly = sample.groupby('year_month').agg({'flg_cod_produto_key':'mean'}).reset_index()
sns.lineplot(x = flg_cod_produto_key_monthly['year_month'].astype(str), y=flg_cod_produto_key_monthly['flg_cod_produto_key'])
plt.title('Average flg_cod_produto_key per churning month', fontsize=15, color='black', weight='medium')
plt.xticks(rotation='vertical', fontsize='x-small')
plt.grid(axis='x')
In [49]:
fig, axs = plt.subplots(1, 2, figsize=(18,4))

lt_monthly = sample[sample['churn']].groupby('year_month').agg({'life_time_days':'mean'}).reset_index()
sns.lineplot(x = lt_monthly['year_month'].astype(str), y=lt_monthly['life_time_days'], ax=axs[0])
axs[0].set_title('Average lifetime per churning month', fontsize=15, color='black', weight='medium')
axs[0].tick_params(labelrotation=90, size=1)
axs[0].grid(axis='x')

lt_coeff_monthly = sample[sample['churn']].groupby('year_month').agg({'lifetime_global_coeff':'mean'}).reset_index()
sns.lineplot(x = lt_coeff_monthly['year_month'].astype(str), y=lt_coeff_monthly['lifetime_global_coeff'], ax=axs[1])
axs[1].set_title('Average lifetime coeff per churning month', fontsize=15, color='black', weight='medium')
axs[1].tick_params(labelrotation=90, size=1)
axs[1].grid(axis='x')

Iteration 1

On this iteration, we'll focus on correcting the time performance time dependency by giving priority to recent observations of churn. We'll do this by sampling a decreasing number of churners as we go further backwards in time.

In addition, we detrend or remove those features with a high time dependence.

Summary of results:

  • Even though accuracy has dropped from 0.9 to 0.72, recall has improved from an overall recall of 0.83 to 0.87.
  • Most importantly, even though performance decreases over time, now it's far more stable with a minimum recall of 0.87 in January 2019.
  • When looking at the metrics based on joining date, we see a strong decrease in performance with customers joining at the end of 2015. This should be analyzed in detail. Some attempts were run to improve this sampling both churners and non churners by joining date but results were not conclusive.
  • Moran's I spatial autocorrelation index has dropped to 0.24 (from 0.56).
  • If we plot the performance by zip code, we can see there are some clusters of zip codes with high performance, but in general there does not seem to be a spatial pattern.
In [50]:
df_final_f = df_final.loc[(df_final['year_month'] > starting_churn_date_cut) & (df_final['year_month_start'] >= starting_join_date_cut), client_features + non_spatial_features]
ending_periods = df_final_f['year_month'].sort_values(ascending=False).unique()
sample_size_f = 12
replace = sample_size_f > 1
sample = df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[0])].sample(frac=sample_size_f, replace=replace, random_state=111)
for i in range(1, len(ending_periods)):
  sample_size_f /= 2
  if sample_size_f < 0.05:
    sample_size_f = 0.05
  replace = sample_size_f > 1
  sample = pd.concat([sample, df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[i])].sample(frac=sample_size_f, replace=replace, random_state=111)])

sample = pd.concat([sample, df_final_f[~df_final_f['churn']].sample(sample.shape[0], random_state=111)])

sample.shape
Out[50]:
(690266, 41)
In [ ]:
features = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 'g_count',
           's_count', 'eg_count', 'lifetime_global_coeff',
           'no_active_accounts', 'no_active_contracts', 'no_products',
           'consent_data', 'apartamento',
           'moradia', 'other_dwellings', 'flg_deb_direto', 'flg_fatura_eletronica',
           'flg_edp_online', 'mean_E', 'mean_G', 'start_month', 
           'time_since_last_upgrade',
           'time_since_last_downgrade', 'solar',
           'active_contracts_rel', 'apartamento_rel', 'moradia_rel',
           'other_dwellings_rel']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(sample.loc[:, features], sample['churn'].astype(int),
                                                    test_size=0.3, random_state=7)
In [53]:
model = XGBClassifier(random_state = 111)
model.fit(X_train, y_train)
Out[53]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=111,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

Performance

In [54]:
calculate_performance_metrics(model, sample, X_train, X_test, y_train, y_test, 2, 'recall_2')
Train score: 0.7175601114270695
Test score: 0.7152163415105274
Confusion matrix:
[[57635 46005]
 [12968 90472]]
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.82      0.56      0.66    103640
           1       0.66      0.87      0.75    103440

    accuracy                           0.72    207080
   macro avg       0.74      0.72      0.71    207080
weighted avg       0.74      0.72      0.71    207080

accuracy 0.7152163415105274
precision 0.6629102339588355
recall 0.8746326372776488
f1 score 0.7541941588132562
ROC_AUC: 0.7923964386676858
*** REDISUAL TIME DEPENDENCY ***
*** RESIDUAL SPATIAL AUTOCORRELATION ***
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: f57b48ce-ad5f-4daf-abae-32362ff39eae
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:189: UserWarning: There are 5 disconnected observations
  warnings.warn("There are %d disconnected observations" % ni)
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:190: UserWarning: Island ids: 104, 198, 199, 357, 387
  warnings.warn("Island ids: %s" % ', '.join(str(island) for island in self.islands))
('WARNING: ', 104, ' is an island (no neighbors)')
('WARNING: ', 198, ' is an island (no neighbors)')
('WARNING: ', 199, ' is an island (no neighbors)')
('WARNING: ', 357, ' is an island (no neighbors)')
('WARNING: ', 387, ' is an island (no neighbors)')
Moran's I: 0.241
Moran's E[I]: -0.002
Moran's Var[I]: 0.001

Feature importance

In [55]:
importances = pd.DataFrame(model.feature_importances_, index=sample.loc[:, features].columns, columns=['importance']).sort_values('importance', ascending=False)
plt.figure(figsize=(16,6))
sns.barplot(x='importance', y='index', data=importances.reset_index()[:25], color='blue')
plt.title('Feature importance', size=14)
Out[55]:
Text(0.5, 1.0, 'Feature importance')

Spatial autocorrelation visualization

Let's visualize the model's performance in the different zip codes.

In [56]:
# RESIDUAL SPATIAL AUTOCORRELATION
residual_metric = 'recall_2'

Map(
    Layer('edp_churn_residuals', '''
        color: ramp(buckets(${}, [0.65, 0.8, 0.9]), Emrld)
        strokeWidth: 0.4
        strokeColor: opacity(blue,0.2)
        order: asc(width())

    '''.format(residual_metric),
    
    popup={
            'hover': [{
                'title': 'Recall',
                'value': '$recall_2'
            },{
                'title': 'Precision',
                'value': '$precision_2'
            },{
                'title': 'Accuracy',
                'value': '$accuracy_2'
            },
                {
                'title': 'zip code',
                'value': '$postcode'
            }]
        },
     
    legend={
            'title':'{}'.format(residual_metric),
            'type': 'color-continuous-polygon',
            'prop': 'color'
        }),
    
    viewport={
        'zoom': 5.75,
        'lat': 39.5129,
        'lng': -8.0977
              },
    
    basemap = basemaps.darkmatter
  )
Out[56]:

Iteration 2

On this iteration, we introduce spatial features and perform feature selection.

Summary of results:

  • Global accuracy increases by 1 percentage point (from 0.72 to 0.73), and global and January's recall decreases by 1 percentage points. However, performance in more regular through time.
  • We can see recall is now very weakly time dependent.
  • Moran's I spatial autocorrelation index has increased to 0.45 (from 0.24). This makes sense because the spatial features are helping find local patterns.
  • If we plot the performance by zip code, we can see there are some clusters of zip codes with high performance, and others with lower performance.
In [ ]:
new_features = ['hh_t', 'hh_size', 'age_t0014', 'age_t1529', 'age_t3044', 'age_t4559',
                'age_t60pl', 'pp_ci', 'main_postal_region_1', 'main_postal_region_2',
                'main_postal_region_3', 'main_postal_region_4', 'main_postal_region_5',
                'main_postal_region_6', 'main_postal_region_7', 'main_postal_region_8',
                'main_postal_region_9', 'main_urban_type_0', 'main_urban_type_1', 'pop_density_cp_main',
                'churn_rate_6m', 'churn_trend_6m', 'cp_qtd_consumo_anual_esp_E_12m',
                'cp_qtd_consumo_anual_esp_G_12m', '1m_churn_rate_fullzc', '1m_churn_fullzc',
                'lifetime_local_churn_coeff', 'qtd_consumo_anual_coeff', 'no_churn_clients_rel',
                'no_clients_cp_main', 'no_clients_rel_cp_main',
                'qtd_consumo_anual_esp_cp_main']
In [58]:
df_final_f = df_final.loc[(df_final['year_month'] > starting_churn_date_cut) & (df_final['year_month_start'] >= starting_join_date_cut), client_features + non_spatial_features + new_features]
ending_periods = df_final_f['year_month'].sort_values(ascending=False).unique()
sample_size_f = 12
replace = sample_size_f > 1
sample = df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[0])].sample(frac=sample_size_f, replace=replace, random_state=111)
for i in range(1, len(ending_periods)):
  sample_size_f /= 2
  if sample_size_f < 0.05:
    sample_size_f = 0.05
  replace = sample_size_f > 1
  sample = pd.concat([sample, df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[i])].sample(frac=sample_size_f, replace=replace, random_state=111)])

sample = pd.concat([sample, df_final_f[~df_final_f['churn']].sample(sample.shape[0], random_state=111)])

sample.shape
Out[58]:
(690266, 73)
In [ ]:
features = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 'g_count',
           's_count', 'eg_count', 'lifetime_global_coeff',
           'no_active_accounts', 'no_active_contracts', 'no_products',
           'consent_data', 'apartamento',
           'moradia', 'other_dwellings', 'flg_deb_direto', 'flg_fatura_eletronica',
           'flg_edp_online', 'mean_E', 'mean_G', 'start_month', 
           'time_since_last_upgrade',
           'time_since_last_downgrade', 'solar',
           'active_contracts_rel', 'apartamento_rel', 'moradia_rel',
           'other_dwellings_rel'] + new_features

Feature selection

At this point, the number of feature is quite high and they may introduce noise. Before training the model, we perfom a recursive feature elimination. First, iterating 3 times over a RFECV process, and finally over a RFE indicating we´d like to keep 25 features, which seems a reasonable amount to later being able to have explainable results.

Spatial features selected in the process:

  • hh_size: Average home size (per zip code)
  • pop_density_cp_main: Population density (per zip code)
  • churn_rate_6m : Churn rate in the previous 6 months (per zip code)
  • churn_trend_6m: Churn trend in the previous 6 months (per zip code)
  • cp_qtd_consumo_anual_esp_E_12m: Annual expected electric consumption in the previous 12 months (per zip code)
  • lifetime_local_churn_coeff: lifetime to zip code's previous 6 month's churn average lifetime ratio
  • no_churn_clients_rel: Total average churn rate (per zip code)
In [ ]:
features = ['id_conta_rgpd_count', 'id_contrato_rgpd_count', 'e_count', 's_count', 'eg_count', 'lifetime_global_coeff', 'no_active_accounts', 'no_active_contracts', 
            'no_products', 'moradia', 'flg_deb_direto', 'flg_fatura_eletronica', 'mean_E', 'time_since_last_upgrade', 'time_since_last_downgrade', 'active_contracts_rel', 
            'apartamento_rel', 'moradia_rel', 'hh_size', 'pop_density_cp_main', 'churn_rate_6m', 'churn_trend_6m', 'cp_qtd_consumo_anual_esp_E_12m', 
            'lifetime_local_churn_coeff', 'no_churn_clients_rel']
In [61]:
'''rfecv_sample = sample.sample(int(np.ceil(sample.shape[0] * 0.1)))'''
Out[61]:
'rfecv_sample = sample.sample(int(np.ceil(sample.shape[0] * 0.1)))'
In [62]:
'''selector = RFECV(RandomForestClassifier(max_depth=6, random_state=111, n_estimators=100, min_samples_leaf = 5), step=1, cv=3, scoring='accuracy', n_jobs=8)
selector = selector.fit(rfecv_sample.loc[:, features], rfecv_sample['churn'].astype(int))'''
Out[62]:
"selector = RFECV(RandomForestClassifier(max_depth=6, random_state=111, n_estimators=100, min_samples_leaf = 5), step=1, cv=3, scoring='accuracy', n_jobs=8)\nselector = selector.fit(rfecv_sample.loc[:, features], rfecv_sample['churn'].astype(int))"
In [63]:
'''plt.figure(figsize=(12,5))
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (accuracy)")
plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)'''
Out[63]:
'plt.figure(figsize=(12,5))\nplt.xlabel("Number of features selected")\nplt.ylabel("Cross validation score (accuracy)")\nplt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_)'
In [64]:
'''selector.n_features_'''
Out[64]:
'selector.n_features_'
In [65]:
'''# This final set of features was obtained after 3 iterations over previous selector
features = np.array(features)[selector.support_].tolist()'''
Out[65]:
'# This final set of features was obtained after 3 iterations over previous selector\nfeatures = np.array(features)[selector.support_].tolist()'
In [66]:
'''selector = RFE(RandomForestClassifier(max_depth=6, random_state=111, n_estimators=100, min_samples_leaf = 5), step=1, n_features_to_select=25)
selector = selector.fit(rfecv_sample.loc[:, features], rfecv_sample['churn'].astype(int))'''
Out[66]:
"selector = RFE(RandomForestClassifier(max_depth=6, random_state=111, n_estimators=100, min_samples_leaf = 5), step=1, n_features_to_select=25)\nselector = selector.fit(rfecv_sample.loc[:, features], rfecv_sample['churn'].astype(int))"
In [67]:
'''prev_features = features'''
Out[67]:
'prev_features = features'
In [68]:
'''features = np.array(features)[selector.support_].tolist()'''
Out[68]:
'features = np.array(features)[selector.support_].tolist()'

Model training

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(sample.loc[:, features], sample['churn'].astype(int),
                                                    test_size=0.3, random_state=7)
In [70]:
model = XGBClassifier(random_state = 111)
model.fit(X_train, y_train)
Out[70]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=111,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
In [71]:
calculate_performance_metrics(model, sample, X_train, X_test, y_train, y_test, 5, 'recall_5')
Train score: 0.7292616094009347
Test score: 0.7259126907475372
Confusion matrix:
[[61149 42491]
 [14267 89173]]
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.81      0.59      0.68    103640
           1       0.68      0.86      0.76    103440

    accuracy                           0.73    207080
   macro avg       0.74      0.73      0.72    207080
weighted avg       0.74      0.73      0.72    207080

accuracy 0.7259126907475372
precision 0.677277008141937
recall 0.8620746326372777
f1 score 0.758583435415816
ROC_AUC: 0.8102242109189911
*** REDISUAL TIME DEPENDENCY ***
*** RESIDUAL SPATIAL AUTOCORRELATION ***
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: 8e9ccdfa-18af-4707-813c-66a65d4e8f7d
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:189: UserWarning: There are 5 disconnected observations
  warnings.warn("There are %d disconnected observations" % ni)
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:190: UserWarning: Island ids: 88, 198, 199, 357, 387
  warnings.warn("Island ids: %s" % ', '.join(str(island) for island in self.islands))
('WARNING: ', 88, ' is an island (no neighbors)')
('WARNING: ', 198, ' is an island (no neighbors)')
('WARNING: ', 199, ' is an island (no neighbors)')
('WARNING: ', 357, ' is an island (no neighbors)')
('WARNING: ', 387, ' is an island (no neighbors)')
Moran's I: 0.453
Moran's E[I]: -0.002
Moran's Var[I]: 0.001

Feature importance

In [72]:
importances = pd.DataFrame(model.feature_importances_, index=sample.loc[:, features].columns, columns=['importance']).sort_values('importance', ascending=False)
plt.figure(figsize=(16,6))
sns.barplot(x='importance', y='index', data=importances.reset_index()[:25], color='blue')
plt.title('Feature importance', size=14)
Out[72]:
Text(0.5, 1.0, 'Feature importance')

Spatial autocorrelation visualization

Let's visualize the model's performance in the different zip codes.

In [73]:
# RESIDUAL SPATIAL AUTOCORRELATION
residual_metric = 'recall_5'


#color: ramp(globalQuantiles(${}, 4), Emrld)
Map(
    Layer('edp_churn_residuals', '''
        color: ramp(buckets(${}, [0.65, 0.8, 0.9]), Emrld)
        strokeWidth: 0.4
        strokeColor: opacity(blue,0.2)
        order: asc(width())

    '''.format(residual_metric),
    
    popup={
            'hover': [{
                'title': 'Recall',
                'value': '$recall_5'
            },{
                'title': 'Precision',
                'value': '$precision_5'
            },{
                'title': 'Accuracy',
                'value': '$accuracy_5'
            },
                {
                'title': 'zip code',
                'value': '$postcode'
            }]
        },
     
    legend={
            'title':'{}'.format(residual_metric),
            'type': 'color-continuous-polygon',
            'prop': 'color'
        }),
    
    viewport={
        'zoom': 5.75,
        'lat': 39.5129,
        'lng': -8.0977
              },
    
    basemap = basemaps.darkmatter
  )
Out[73]:

Iteration 3

On this iteration we optimize our algorithm performing some hyperparameter tuning.

Summary of results:

  • Global accuracy is satble at 0.73, and global and January's recall as well at 0.86 and 0.84, respectively.
  • 4 of the top 10 describing features are spatial features:
    • churn_trend_6m: Churn trend in the last 6 months in the client's zip code
    • no_churn_clients_rel: Proportion of clients that have churned throughout the company's history in the client's zip code
    • churn_rate_6m: Average churn rate in the last 6 months in the client's zip code.
    • lifetime_local_churn_coeff: ratio of client's lifetime to churners' lifetime within his/her zip code in the last 6 months.
  • Moran's I spatial autocorrelation index has slightly decreased to 0.43 (from 0.45). This makes sense because the spatial features are helping find local patterns.
  • If we plot the performance by zip code, we can see there are some clusters of zip codes with high performance, and others with lower performance.
In [74]:
df_final_f = df_final.loc[(df_final['year_month'] > starting_churn_date_cut) & (df_final['year_month_start'] >= starting_join_date_cut), client_features + non_spatial_features + new_features]
ending_periods = df_final_f['year_month'].sort_values(ascending=False).unique()
sample_size_f = 16
replace = sample_size_f > 1
sample = df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[0])].sample(frac=sample_size_f, replace=replace, random_state=111)
for i in range(1, len(ending_periods)):
  sample_size_f /= 2
  if sample_size_f < 0.025:
    sample_size_f = 0.025
  replace = sample_size_f > 1
  sample = pd.concat([sample, df_final_f[(df_final_f['churn']) & (df_final_f['year_month'] == ending_periods[i])].sample(frac=sample_size_f, replace=replace, random_state=111)])
  

'''initial_periods = df_final_f['year_month_start'].unique()
no_initial_periods = len(initial_periods)
sample_size = int(np.ceil(sample.shape[0] / no_initial_periods))

for initial_period in initial_periods:
  sample = pd.concat([sample, df_final_f[(~df_final_f['churn']) & (df_final_f['year_month_start'] == initial_period)].sample(sample_size, replace=True, random_state=111)])'''

sample = pd.concat([sample, df_final_f[~df_final_f['churn']].sample(sample.shape[0], random_state=111)])

sample.shape
Out[74]:
(901418, 73)

Hyperparameter optimization

In [ ]:
opt_sample = sample.sample(int(np.ceil(sample.shape[0] * 0.1)))
In [76]:
param_range = [2, 3, 4, 5]

viz = ValidationCurve(
    XGBClassifier(random_state = 111), param_name="max_depth", param_range=param_range,
    cv=3, scoring="accuracy", n_jobs=8,
)

viz.fit(opt_sample.loc[:, features], opt_sample['churn'].astype(int))
viz.poof()
In [77]:
param_range = [0.01, 0.05, 0.1, 0.15, 0.2, 0.25]

viz = ValidationCurve(
    XGBClassifier(max_depth=3, random_state = 111), param_name="learning_rate", param_range=param_range,
    cv=3, scoring="accuracy", n_jobs=8,
)

viz.fit(opt_sample.loc[:, features], opt_sample['churn'].astype(int))
viz.poof()
In [78]:
param_range = [50, 100, 150, 200, 300]

viz = ValidationCurve(
    XGBClassifier(max_depth=4, learning_rate=0.15, random_state = 111), param_name="n_estimators", param_range=param_range,
    cv=3, scoring="accuracy", n_jobs=8,
)

viz.fit(opt_sample.loc[:, features], opt_sample['churn'].astype(int))
viz.poof()

Model training

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(sample.loc[:, features], sample['churn'].astype(int),
                                                    test_size=0.3, random_state=7)
In [80]:
model = XGBClassifier(random_state = 111, max_depth=3, learning_rate=0.15, n_estimators=100, base_score=0.5)
model.fit(X_train, y_train)
Out[80]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.15, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=111,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
In [81]:
calculate_performance_metrics(model, sample, X_train, X_test, y_train, y_test, 6, 'recall_6')
Train score: 0.7336479701802876
Test score: 0.7325848845894996
Confusion matrix:
[[ 81883  53098]
 [ 19218 116227]]
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.81      0.61      0.69    134981
           1       0.69      0.86      0.76    135445

    accuracy                           0.73    270426
   macro avg       0.75      0.73      0.73    270426
weighted avg       0.75      0.73      0.73    270426

accuracy 0.7325848845894996
precision 0.6864137014616861
recall 0.8581121488427037
f1 score 0.7627194277652001
ROC_AUC: 0.8224607880102875
*** REDISUAL TIME DEPENDENCY ***
*** RESIDUAL SPATIAL AUTOCORRELATION ***
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: 00ef12d5-575a-49c9-a1d9-e7c90ffa3243
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:189: UserWarning: There are 5 disconnected observations
  warnings.warn("There are %d disconnected observations" % ni)
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:190: UserWarning: Island ids: 88, 198, 199, 358, 387
  warnings.warn("Island ids: %s" % ', '.join(str(island) for island in self.islands))
('WARNING: ', 88, ' is an island (no neighbors)')
('WARNING: ', 198, ' is an island (no neighbors)')
('WARNING: ', 199, ' is an island (no neighbors)')
('WARNING: ', 358, ' is an island (no neighbors)')
('WARNING: ', 387, ' is an island (no neighbors)')
Moran's I: 0.418
Moran's E[I]: -0.002
Moran's Var[I]: 0.001

Feature importance

In [ ]:
X = X_test
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
In [83]:
shap.summary_plot(shap_values, X, plot_type="bar")
In [84]:
shap.summary_plot(shap_values, X)

Spatial autocorrelation visualization

Let's visualize the model's performance in the different zip codes.

In [85]:
# RESIDUAL SPATIAL AUTOCORRELATION
residual_metric = 'recall_6'


#color: ramp(globalQuantiles(${}, 4), Emrld)
Map(
    Layer('edp_churn_residuals', '''
        color: ramp(buckets(${}, [0.65, 0.8, 0.9]), Emrld)
        strokeWidth: 0.4
        strokeColor: opacity(blue,0.2)
        order: asc(width())

    '''.format(residual_metric),
    
    popup={
            'hover': [{
                'title': 'Recall',
                'value': '$recall_6'
            },{
                'title': 'Precision',
                'value': '$precision_6'
            },{
                'title': 'Accuracy',
                'value': '$accuracy_6'
            },
                {
                'title': 'zip code',
                'value': '$postcode'
            }]
        },
     
    legend={
            'title':'{}'.format(residual_metric),
            'type': 'color-continuous-polygon',
            'prop': 'color'
        }),
    
    viewport={
        'zoom': 5.75,
        'lat': 39.5129,
        'lng': -8.0977
              },
    
    basemap = basemaps.darkmatter
  )
Out[85]:

4. Spatial further improvements

Here, we focus on two improvements idientified throughout the analysis related to spatial components:

  • Recall & churn rate's spatial autocorrelation
  • Region specific models to focus on local churn patterns: Algarve

4.1 Recall and churn rate's spatial autocorrelation

On the las result obtained from the optimized algorithm, we have a Moran's I value of 0.43 for the recall. Visualizing these data we can see low recall areas are clustered.

On the other hand, low churn rate zip codes are also clustered (see map below).

If we analyze the relationship between these two variables, we can see that recall is high and robust (low variance) in zip codes with a high churn rate, while it is lower and with a higher variance in zip codes with lower churn rate. Since the total average churn rate of a zip code is one of the top 5 most important features, this is one of the reasons for having cluster areas of low recall.

In [ ]:
prediction = model.predict(X_test)
In [ ]:
accuracy_cp =  calculate_metric(sample, X_test, y_test, prediction, 'cp_main', 'accuracy', np.full((X_test.shape[0],), True), y_test == prediction)
In [ ]:
accuracy_cp = accuracy_cp.merge(demog[['postcode', 'no_churn_clients_rel']], how='left', left_on='cp_main', right_on='postcode')
In [89]:
plt.figure(figsize=(12,7))
sns.regplot(accuracy_cp['no_churn_clients_rel'], accuracy_cp['accuracy'])
plt.hlines(0.65, 0.051, 0.225, 'red', '--')
Out[89]:
<matplotlib.collections.LineCollection at 0x7f226a8bacc0>
In [90]:
Map(
    Layer('edp_cp_demographics_agg', '''
        color: ramp(buckets($no_churn_clients_rel, [0.091980, 0.125496, 0.157762]), Emrld)
        strokeWidth: 0.4
        strokeColor: opacity(blue,0.2)
        order: asc(width())

    ''',
    
    popup={
            'hover': [
                {
                'title': 'zip code',
                'value': '$postcode'
            }]
        },
     
    legend={
            'title':'{}'.format(residual_metric),
            'type': 'color-continuous-polygon',
            'prop': 'color'
        }),
    
    viewport={
        'zoom': 5.75,
        'lat': 39.5129,
        'lng': -8.0977
              },
    
    basemap = basemaps.darkmatter
  )
Out[90]:

4.2 Region specific models to focus on local churn patterns: Algarve

Throughout the process of building the model, we have seen a consistent lower recall in the Southern part of Portugal (among other areas). This is because churn patterns in this region are different to other parts of Portugal, probably due to a higher share of second residences.

Here we build a model for the region of Algarve (postal codes starting by 8) and see that we get much better results than with the general model. Taking a close look at the results, we can see that the main differences wrt the general model are:

  • In Algarve, churn trend influences in the opposite way than in the rest of Prtugal: A higher churn trend in your zip code implies a lower churn risk.
  • Population density is one of the top 10 most influencial features, while in the general model is not within the 20 first.
  • A higher expected average annual electric consumption of each zip code does not imply higher churn, as it does in the general model.
  • A higher number of accounts (active and non-active) implies higher churn. This is not the case in the general model.

    As a result, we go from a performance of 0.73 in accuracy and 0.82 in AUC with the general model, to an accuracy of 0.83 and an AUC of 0.92 with the specialized model (further details below).

Perfomance with general model

In [ ]:
algarve_sample = sample[sample['cp_main'].astype(str).str[0] == '8']
In [ ]:
algarve_prediction = model.predict(algarve_sample[features])
In [ ]:
algarve_probabilities = model.predict_proba(algarve_sample[features])
In [94]:
# METRIC SUMMARY
print('*** METRIC SUMMARY ***')
print(classification_report(algarve_sample['churn'].astype(int), algarve_prediction))
print('accuracy', accuracy_score(algarve_sample['churn'].astype(int), algarve_prediction))
print('precision', precision_score(algarve_sample['churn'].astype(int), algarve_prediction))
print('recall', recall_score(algarve_sample['churn'].astype(int), algarve_prediction))
print('f1 score', f1_score(algarve_sample['churn'].astype(int), algarve_prediction))

fpr, tpr, _ = roc_curve(algarve_sample['churn'].astype(int), algarve_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print('ROC_AUC:', roc_auc)
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.81      0.73      0.77     26428
           1       0.63      0.74      0.68     16780

    accuracy                           0.73     43208
   macro avg       0.72      0.73      0.73     43208
weighted avg       0.74      0.73      0.74     43208

accuracy 0.7325263840029624
precision 0.6339024765420704
recall 0.7367699642431466
f1 score 0.6814761734145467
ROC_AUC: 0.8164740037158552

Performance with specialized model

In [95]:
reg = '8'
df_final_reg = df_final.loc[(df_final['year_month'] > starting_churn_date_cut) & (df_final['year_month_start'] >= starting_join_date_cut) & (df_final['cp_main'].astype(str).str[0] == reg), 
                            client_features + non_spatial_features + new_features]
ending_periods = df_final_reg['year_month'].sort_values(ascending=False).unique()
sample_size_f = 16
replace = sample_size_f > 1
sample = df_final_reg[(df_final_reg['churn']) & (df_final_reg['year_month'] == ending_periods[0])].sample(frac=sample_size_f, replace=replace, random_state=111)
for i in range(1, len(ending_periods)):
  sample_size_f /= 2
  if sample_size_f < 0.025:
    sample_size_f = 0.025
  replace = sample_size_f > 1
  sample = pd.concat([sample, df_final_reg[(df_final_reg['churn']) & (df_final_reg['year_month'] == ending_periods[i])].sample(frac=sample_size_f, replace=replace, random_state=111)])
  

'''postal_cpdes = df_final_reg['cp_main'].unique()
for postal_code in postal_cpdes:
  sample = pd.concat([sample, df_final_reg[(~df_final_reg['churn']) & (df_final_reg['cp_main'] == postal_code)].sample(sample[sample['cp_main'] == postal_code].shape[0], random_state=111)])'''

sample = pd.concat([sample, df_final_reg[~df_final_reg['churn']].sample(sample.shape[0], random_state=111)])

sample.shape
Out[95]:
(33330, 73)
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(sample.loc[:, features], sample['churn'].astype(int),
                                                    test_size=0.3, random_state=7)
In [97]:
model_algarve = XGBClassifier(random_state = 111, max_depth=3, learning_rate=0.15, n_estimators=100, base_score=0.5)
model_algarve.fit(X_train, y_train)
Out[97]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.15, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=111,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
In [98]:
calculate_performance_metrics(model_algarve, sample, X_train, X_test, y_train, y_test, 7, 'recall_7')
Train score: 0.8282542539968283
Test score: 0.8173817381738174
Confusion matrix:
[[4085  979]
 [ 847 4088]]
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       0.83      0.81      0.82      5064
           1       0.81      0.83      0.82      4935

    accuracy                           0.82      9999
   macro avg       0.82      0.82      0.82      9999
weighted avg       0.82      0.82      0.82      9999

accuracy 0.8173817381738174
precision 0.8067890270376948
recall 0.8283687943262411
f1 score 0.8174365126974604
ROC_AUC: 0.9123642902759571
*** REDISUAL TIME DEPENDENCY ***
*** RESIDUAL SPATIAL AUTOCORRELATION ***
/usr/local/lib/python3.6/dist-packages/carto/sql.py:229: UserWarning: Batch SQL job created with job_id: 25ad29b8-c14b-4e87-8ae5-ed84ec824a41
  warnings.warn('Batch SQL job created with job_id: {job_id}'.format(job_id=data['job_id']))
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:189: UserWarning: There are 5 disconnected observations
  warnings.warn("There are %d disconnected observations" % ni)
/usr/local/lib/python3.6/dist-packages/pysal/weights/weights.py:190: UserWarning: Island ids: 88, 198, 199, 358, 387
  warnings.warn("Island ids: %s" % ', '.join(str(island) for island in self.islands))
('WARNING: ', 88, ' is an island (no neighbors)')
('WARNING: ', 198, ' is an island (no neighbors)')
('WARNING: ', 199, ' is an island (no neighbors)')
('WARNING: ', 358, ' is an island (no neighbors)')
('WARNING: ', 387, ' is an island (no neighbors)')
Moran's I: 0.729
Moran's E[I]: -0.002
Moran's Var[I]: 0.001
In [ ]:
X_a = X_test
explainer = shap.TreeExplainer(model_algarve)
shap_values = explainer.shap_values(X_a)
In [100]:
shap.summary_plot(shap_values, X_a, plot_type="bar")
In [101]:
shap.summary_plot(shap_values, X_a)

5. Model in production and maintenance

Once the model is trained, the following step would be to run it every month to identify those clients at risk of churning and for which the company can still do something to retain them.

The model we have trained provides with a churning score we can use to classify clients based on this score. We could for example classify the in Low, Mid, and High churn risk, so that they can focus their actions depending on this classification.

In addition, it is very important to track the model's performace through time to know when it should be retrained. Two different policies may be followed:

  • Fix time policy: Retraining the model every 6 months, for example.
  • Metric based policy: Retraining the model every time its performance falls under a certain threshold.

This retraining is very important as churn patterns will change over time. We have actually identified to importante patterns in our data:

  • flg_cod_produto_key was a very good predictor up until the beginning of 2018.
  • flg_deb_direto has experienced a high increase in January 2019. It would be advisable to keep track of this feature because it can force us to retrain the model.

January

In [102]:
model.score(df_final.loc[df_final['year_month'] == 201901, features], df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int))
Out[102]:
0.6166390482089564
In [ ]:
final_prediction = model.predict(df_final.loc[df_final['year_month'] == 201901, features])
In [ ]:
final_probabilities = model.predict_proba(df_final.loc[df_final['year_month'] == 201901, features])
In [105]:
# METRIC SUMMARY
print('*** METRIC SUMMARY ***')
print(classification_report(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_prediction))
print('accuracy', accuracy_score(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_prediction))
print('precision', precision_score(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_prediction))
print('recall', recall_score(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_prediction))
print('f1 score', f1_score(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_prediction))

fpr, tpr, _ = roc_curve(df_final.loc[df_final['year_month'] == 201901, 'churn'].astype(int), final_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print('ROC_AUC:', roc_auc)
plot_roc_curve(fpr, tpr, roc_auc)
plt.show()
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       1.00      0.62      0.76   2968042
           1       0.01      0.83      0.02     15470

    accuracy                           0.62   2983512
   macro avg       0.50      0.72      0.39   2983512
weighted avg       0.99      0.62      0.76   2983512

accuracy 0.6166390482089564
precision 0.011183548131627424
recall 0.8343244990303814
f1 score 0.022071246332004076
ROC_AUC: 0.8024817990797742

Explaining churn at a client level

Apart from understanding at a global level what the factors driving churn are, the techniques used allow us to understand at a client level why they churn.

Below, two examples are shown (a churner, and a non-churner) where we can see the main factors which made the algorithm classify them as churner and non-churner, respectively.

This information is very powerful because it allows them design tailored client retention strategies.

Churner

Here we can see a customer for whom the model predicts he/she will churn. On the figure below we can see the two main factors for identifying this customer as a churner are:

  • There has been a positive trend in churn for the last 6 months in the postal code where he/she lives.
  • He/she does not have a debit receipt.

The rest of the predictors are weak in determining the churn risk os this client.

In [106]:
shap.initjs()
idx_churner = 23
shap.force_plot(explainer.expected_value, shap_values[idx_churner,:], X.iloc[idx_churner,:])
Out[106]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [107]:
# Client characteristics
df_final.loc[2839767].to_frame().transpose()
Out[107]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar no_churn_clients_rel active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12
2839767 626047208732 False 8200 201801 201901 2 2 1 0 1 1 361 1375.99 1 1 2 0 8 0 1 0 0 1 1 1 2496 0 2496 0 2496 0 2496 0 1 1 40905 17047 2.39954 0.158416 0.172351 0.247207 0.221856 0.200171 109.713 43898 1.07317 27385 0.669478 862 1776 144.027 284.008 0.0323697 0.0223873 1085.01 1334.45 2136 0 0.00662252 2 9000 8868 0 0.0809538 0.5 0.5 1 0 0 0.262166 0.332409 0.418308 1.16799 0 1.40461 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
Non churner

In this case, we can see a customer for whom the model predicts he/she will not churn. On the figure below we can see the two main factors for identifying this customer as a non-churner are:

  • He/she has a debit receipt.
  • He/she has been in the company 1.73 times longer than the average client.
In [108]:
shap.initjs()
idx_non_churner = 27
shap.force_plot(explainer.expected_value, shap_values[idx_non_churner,:], X.iloc[idx_non_churner,:])
Out[108]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [109]:
# Client characteristics
df_final.loc[1580680].to_frame().transpose()
Out[109]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar no_churn_clients_rel active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12
1580680 152053088021 False 8900 201404 201901 1 1 1 0 0 1 1734 1375.99 1 1 1 0 8 0 1 0 0 1 1 1 3240 0 3240 0 3240 0 3240 0 4 1 18841 7561 2.49187 0.154026 0.154981 0.202802 0.202059 0.286131 90.715 18788 0.997187 13222 0.701767 809.5 1056 61.5231 306.243 0.0264577 0.049588 939.593 1194.72 1356 0 0.0352113 5 9000 8868 0 0.0710883 1 1 1 0 0 1.25927 1.84352 2.13942 2.38762 0 3.06528 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0

February

In [110]:
df_feb = load_data('/gdrive/My Drive/final_dataframe_201902.csv', cat_vars, df_demog)
df_feb.head(2)
Out[110]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12 no_churn_clients_rel
0 852564277059 False 2815 201211 201902 1 2 1 1 0 2 2254.0 1396.870286 1 2 2 0 2 1 2 0 0 1 1 0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 11 1 15917.0 6061.0 2.626134 0.157065 0.143620 0.211409 0.218697 0.269209 117.480025 7891.0 0.495759 4868.0 0.305837 724.5 1884.0 6.350020 2506.606282 0.059416 0.058990 1012.208145 1322.135995 1800.0 1740.0 0.000000 0.0 2249.0 8930.0 0 1.0 1.0 1.0 0.0 0.0 1.612453 2.224617 3.106823 1.119378 2.508903 3.386737 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0.168412
1 357752500269 False 6000 201410 201902 1 1 1 0 0 1 1563.0 1396.870286 1 1 1 1 6 0 0 0 1 1 1 0 552.0 0.0 552.0 0.0 552.0 0.0 552.0 0.0 10 1 42698.0 18806.0 2.270446 0.121903 0.154597 0.207972 0.218535 0.296993 104.689954 34201.0 0.800998 20453.0 0.479015 782.0 1632.0 1134.547781 37.634378 0.042441 0.063803 1166.060755 1447.050319 1800.0 1368.0 0.006711 1.0 9062.0 8930.0 0 1.0 1.0 0.0 0.0 1.0 1.118129 1.339262 1.996169 0.306496 0.000000 0.338028 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0.115332
In [111]:
model.score(df_feb.loc[df_feb['year_month'] == 201902, features], df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int))
Out[111]:
0.5457460229565833
In [ ]:
final_prediction = model.predict(df_feb.loc[df_feb['year_month'] == 201902, features])
final_probabilities = model.predict_proba(df_feb.loc[df_feb['year_month'] == 201902, features])
In [113]:
# METRIC SUMMARY
print('*** METRIC SUMMARY ***')
print(classification_report(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_prediction))
print('accuracy', accuracy_score(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_prediction))
print('precision', precision_score(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_prediction))
print('recall', recall_score(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_prediction))
print('f1 score', f1_score(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_prediction))

fpr, tpr, _ = roc_curve(df_feb.loc[df_feb['year_month'] == 201902, 'churn'].astype(int), final_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print('ROC_AUC:', roc_auc)
plot_roc_curve(fpr, tpr, roc_auc)
plt.show()
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       1.00      0.54      0.70   3015115
           1       0.01      0.70      0.02     15214

    accuracy                           0.55   3030329
   macro avg       0.50      0.62      0.36   3030329
weighted avg       0.99      0.55      0.70   3030329

accuracy 0.5457460229565833
precision 0.007679238282055232
recall 0.6978440909688445
f1 score 0.015191307887618377
ROC_AUC: 0.6617299482532395

March

In [114]:
df_mar = load_data('/gdrive/My Drive/final_dataframe_201903.csv', cat_vars, df_demog)
df_mar.head(2)
Out[114]:
id_cliente_rgpd churn cp_main year_month_start year_month id_conta_rgpd_count id_contrato_rgpd_count e_count g_count s_count eg_count life_time_days lt_global_pop_mean_6m no_active_accounts no_active_contracts no_products consent_data main_postal_region main_urban_type apartamento moradia other_dwellings flg_deb_direto flg_fatura_eletronica flg_edp_online mean_E mean_G median_E median_G min_E min_G max_E max_G start_month flg_cod_produto_key p_t hh_t hh_size age_t0014 age_t1529 age_t3044 age_t4559 age_t60pl pp_ci no_contracts_cp_main no_contracts_rel_cp_main no_clients_cp_main no_clients_rel_cp_main median_life_time_days_cp_main qtd_consumo_anual_esp_cp_main area_cp_main pop_density_cp_main churn_rate_6m churn_trend_6m lt_local_churn_mean_12m lt_local_pop_mean_12m cp_qtd_consumo_anual_esp_E_12m cp_qtd_consumo_anual_esp_G_12m 1m_churn_rate_fullzc 1m_churn_fullzc time_since_last_upgrade time_since_last_downgrade solar active_accounts_rel active_contracts_rel apartamento_rel moradia_rel other_dwellings_rel lifetime_global_coeff lifetime_local_churn_coeff lifetime_coeff qtd_consumo_anual_E_12m_coeff qtd_consumo_anual_G_12m_coeff qtd_consumo_anual_coeff consent_data_0 consent_data_1 consent_data_2 main_postal_region_1 main_postal_region_2 main_postal_region_3 main_postal_region_4 main_postal_region_5 main_postal_region_6 main_postal_region_7 main_postal_region_8 main_postal_region_9 main_urban_type_0 main_urban_type_1 start_month_1 start_month_2 start_month_3 start_month_4 start_month_5 start_month_6 start_month_7 start_month_8 start_month_9 start_month_10 start_month_11 start_month_12 no_churn_clients_rel
0 852564277059 False 2815 201211 201903 1 2 1 1 0 2 2282.0 1417.297374 1 2 2 0 2 1 2 0 0 1 1 0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 2016.0 4368.0 11 1 15917.0 6061.0 2.626134 0.157065 0.143620 0.211409 0.218697 0.269209 117.480025 7891.0 0.495759 4868.0 0.305837 724.5 1884.0 6.350020 2506.606282 0.059799 0.006690 1045.247664 1343.455221 1800.0 1812.0 0.00 0.0 2277.0 8986.0 0 1.0 1.0 1.0 0.0 0.0 1.608971 2.181128 3.145417 1.119378 2.409266 3.386737 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0.168412
1 357752500269 False 6000 201410 201903 1 1 1 0 0 1 1591.0 1417.297374 1 1 1 1 6 0 0 0 1 1 1 0 552.0 0.0 552.0 0.0 552.0 0.0 552.0 0.0 10 1 42698.0 18806.0 2.270446 0.121903 0.154597 0.207972 0.218535 0.296993 104.689954 34201.0 0.800998 20453.0 0.479015 782.0 1632.0 1134.547781 37.634378 0.043883 0.032985 1202.487805 1467.760079 1800.0 1392.0 0.02 3.0 9118.0 8986.0 0 1.0 1.0 0.0 0.0 1.0 1.121768 1.321991 2.031928 0.306496 0.000000 0.338028 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0.115332
In [115]:
model.score(df_mar.loc[df_mar['year_month'] == 201903, features], df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int))
Out[115]:
0.5603286354100215
In [ ]:
final_prediction = model.predict(df_mar.loc[df_mar['year_month'] == 201903, features])
final_probabilities = model.predict_proba(df_mar.loc[df_mar['year_month'] == 201903, features])
In [117]:
# METRIC SUMMARY
print('*** METRIC SUMMARY ***')
print(classification_report(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_prediction))
print('accuracy', accuracy_score(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_prediction))
print('precision', precision_score(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_prediction))
print('recall', recall_score(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_prediction))
print('f1 score', f1_score(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_prediction))

fpr, tpr, _ = roc_curve(df_mar.loc[df_mar['year_month'] == 201903, 'churn'].astype(int), final_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print('ROC_AUC:', roc_auc)
plot_roc_curve(fpr, tpr, roc_auc)
plt.show()
*** METRIC SUMMARY ***
              precision    recall  f1-score   support

           0       1.00      0.56      0.72   3024828
           1       0.01      0.68      0.02     16232

    accuracy                           0.56   3041060
   macro avg       0.50      0.62      0.37   3041060
weighted avg       0.99      0.56      0.71   3041060

accuracy 0.5603286354100215
precision 0.008160857820995852
recall 0.675086249383933
f1 score 0.016126765382642756
ROC_AUC: 0.6561762630521392

6. Future steps

  • Work together with various internal departments to add more business knowledge to the model
  • Iterate again over the sampling stage, in order to get a more representative sample of the non-churners
In [ ]: